#712 Read Variant metadata offset-size from the correct header bits#716
Merged
Conversation
iifawzi
reviewed
Jun 27, 2026
| // bit 4: sorted_strings flag | ||
| // bit 5-6: offset_size_minus_one (0..3 → 1..4 bytes) | ||
| // bit 7: unused | ||
| // bit 5: unused |
Contributor
There was a problem hiding this comment.
noice noice, i was confused as it's not specified in the docs, but found the correction here apache/parquet-format#574
iifawzi
approved these changes
Jun 27, 2026
iifawzi
left a comment
Contributor
There was a problem hiding this comment.
Looks good to me, nice catch, seems like it was gunnar vs variant today, seeing a couple of bugs!
Collaborator
Author
ca13218 to
ef422dc
Compare
The metadata header's offset_size_minus_one field lives in bits 6-7, but it was read from bits 5-6. Every existing fixture uses offset_size=1, where both readings yield 0, so the bug stayed invisible across the whole suite; any Variant whose dictionary string section exceeds 255 bytes (needing offset_size >= 2) was misparsed into garbage. Also guard a 4-byte dictionary_size that reads back as a negative int, and enrich variant decode errors raised on the top-level read path with the originating [fileName] prefix so they are attributable like other read errors. Regression coverage uses a real file generated via simple-datagen (a 320-byte dictionary needing offset_size=2) read end-to-end through the VARIANT row API, plus a corrupt-metadata fixture asserting the enriched message. The bit-layout comment and class @see cite the rendered Variant spec, and a stale PqVariantObjectImpl field-lookup javadoc is corrected.
f316c79 to
0cb363f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #712.
Problem
The Variant metadata header's
offset_size_minus_onefield lives in bits 6-7, butVariantBinaryread it from bits 5-6 (METADATA_OFFSET_SIZE_SHIFT = 5). Every existing fixture usesoffset_size = 1, where both readings yield0, so the bug was invisible across the whole suite (125 byte-exact shredded-variant cases + the comparison sweep). Any Variant whose dictionary string section exceeds 255 bytes — needingoffset_size >= 2— was misparsed into garbage.It surfaced via the
negative_dictionary_sizefixture from apache/parquet-testing#113, which usesoffset_size = 4.Fix
METADATA_OFFSET_SIZE_SHIFT5 → 6 (+ corrected the layout comment).dictionary_sizethat reads back as a negativeint: reject with a clear "not a valid unsigned int" message rather than letting the bogus size drive later arithmetic. (The(dictSize+1)*offsetSizeoverflow widening is intentionally left to Unguarded 32-bit overflow in Variant object/array offset arithmetic #713 to avoid overlap.)Tests
tools/simple-datagen.py:variant_metadata_offset_size2.parquet— a VARIANT object whose metadata dictionary is 320 bytes, forcingoffset_size = 2.VariantLogicalTypeTest.readsVariantWhoseMetadataUsesOffsetSizeTworeads it through the VARIANT row API and resolves both long-named fields ('a'*160→ INT8 5,'b'*160→ BOOLEAN_TRUE).VariantMetadataTest.negativeDictionarySizeRejectedfor the guard.Notes
internalpackages — no public API or config change, so no usage-docs update.